1 Executive summary

This analysis of lego dataset refers to data downloaded from the course website on 07.12.2023.

The analysis focuses on the lego data set. It is divided into a couple of chapters:

  • Libraries - presents the libraries used to prepare the report,
  • Data loading - presents code to load datasets,
  • Data intoduction - presents the datasets used for analysis, including their structure, dimensions, and basic statistics,
  • Detailed analysis - presents detailed analysis of attribute values,
  • Correlation - presents the correlations between variables,
  • Trends - presents data trends over the years,
  • Forcasting - presents predictions of the number of sets in the future.

Conclusions:

  • Lego parts colors
  • Lego elements
    • The most popular element colors are black, white, red and yellow. There is a high similarity between the most popular colors of parts and elements, because parts are composed of elements.
  • Minifigs
    • Most popular number of parts used to build minifigs is 4. Minifigures are more likely built from a small number of parts.
    • The most popular minifigs are “Skeleton”, “Battle Droid” and “Classic Spaceman”. The top 10 most popular minifigures include figures from the Star Wars movie and the Minecraft game. This means that collaborations are important in the developing of new minifigures.
  • Themes
    • The most popular themes are “Gear”, “Duplo” and “Educational and Data”. Collaborations such as “Star Wars” are also very popular.
  • Parts
  • Sets
    • Sets with the most parts are “World Map”, “Eiffel Tower” and “The Ultimate Battle for Chima”. The most parts are used in the sets that are not intended for playing, but for building.
    • Correlation between total number of sets and year equals 0.879. The rapid growth over the past 20 years leads us to conclude that even more sets will be produced annually in the future.
    • In a chart showing the number of sets over time for the most popular themes, some trends can be seen. Over the past few years, the company has been releasing more and more “Books” theme sets. The number of “Gear” sets has also been growing. The number of Star Wars-themed sets has declined slightly in recent years following the release of the last 9 parts of the film in 2019.
    • Trends in sets over the years show that the Lego company is growing. More parts are used each year, and the average number of parts per set is increasing. The number of themes is also growing, providing a wider choice of subjects. The number of large sets with more than 1,000 parts and the maximum number of parts per set has increased dramatically over the past few years. Only the median remains the same over the years. This is due to the small share of large collections, which represent outliers compared to all sets produced.

2 Libraries

Libraries used to prepare the report.

library(knitr)
library(dplyr)
library(R.utils)
library(data.table)
library(tools)
library(stringr)
library(ggplot2)
library(plotly)
library(tidyr)
library(scales)
library(forecast)

3 Data loading

Code to load datasets from compressed files stored in a specified folder.

folder_name = "data"
csv_files <- list.files("data",
                        pattern = "\\.csv.gz$",
                        full.names = FALSE)
files_names <- file_path_sans_ext(csv_files, compression = TRUE)

for (file_name in files_names) {
  assign(paste0(file_name, "_df"),
         fread(file.path(
           folder_name,
           paste0(file_name, ".csv.gz")
         )))
}

csv_files
##  [1] "colors.csv.gz"             "elements.csv.gz"          
##  [3] "inventories.csv.gz"        "inventory_minifigs.csv.gz"
##  [5] "inventory_parts.csv.gz"    "inventory_sets.csv.gz"    
##  [7] "minifigs.csv.gz"           "part_categories.csv.gz"   
##  [9] "part_relationships.csv.gz" "parts.csv.gz"             
## [11] "sets.csv.gz"               "themes.csv.gz"

4 Data intoduction

The section below presents the datasets used for analysis, including their structure, dimensions, and basic statistics.

Total dataset size
Rows_count Columns_count Values_count NA_percentage
1446639 45 8099232 0.29%

4.1 Data structure

4.2 Datasets summaries

4.2.1 Colors

Colors dataset dimensions
Rows Columns
263 4
Colors dataset basic statistics
id name rgb is_trans
Min. : -1.0 Length:263 Length:263 Length:263
1st Qu.: 83.0 Class :character Class :character Class :character
Median :1005.0 Mode :character Mode :character Mode :character
Mean : 651.4
3rd Qu.:1070.5
Max. :9999.0
Head of Colors dataset
id name rgb is_trans
-1 [Unknown] 0033B2 f
0 Black 05131D f
1 Blue 0055BF f
2 Green 237841 f
3 Dark Turquoise 008F9B f
4 Red C91A09 f

4.2.2 Elements

Elements dataset dimensions
Rows Columns
84138 4
Elements dataset basic statistics
element_id part_num color_id design_id
Min. : 9327 Length:84138 Min. : -1.0 Min. : 1001
1st Qu.: 4259774 Class :character 1st Qu.: 8.0 1st Qu.: 18454
Median : 6057754 Mode :character Median : 28.0 Median : 41748
Mean : 5222065 Mean : 539.7 Mean : 45570
3rd Qu.: 6262024 3rd Qu.: 135.0 3rd Qu.: 75474
Max. :61532443 Max. :9999.0 Max. :107520
NA’s :23682
Head of Elements dataset
element_id part_num color_id design_id
6443403 2277c01pr0009 1 2277
6300211 67906c01 14 67908
4566309 2564 0 2564
4275423 53657 1004 53657
6194308 92926 71 28967
6229123 26561 4 26561

4.2.3 Inventories

Inventories dataset dimensions
Rows Columns
37265 3
Inventories dataset basic statistics
id version set_num
Min. : 1 Min. : 1.000 Length:37265
1st Qu.: 14424 1st Qu.: 1.000 Class :character
Median : 54379 Median : 1.000 Mode :character
Mean : 61104 Mean : 1.091
3rd Qu.: 88842 3rd Qu.: 1.000
Max. :194312 Max. :16.000
Head of Inventories dataset
id version set_num
1 1 7922-1
3 1 3931-1
4 1 6942-1
15 1 5158-1
16 1 903-1
17 1 850950-1

4.2.4 Inventory minifigs

Inventory minifigs dataset dimensions
Rows Columns
20858 3
Inventory minifigs dataset basic statistics
inventory_id fig_num quantity
Min. : 3 Length:20858 Min. : 1.000
1st Qu.: 7869 Class :character 1st Qu.: 1.000
Median : 15681 Mode :character Median : 1.000
Mean : 43010 Mean : 1.062
3rd Qu.: 66834 3rd Qu.: 1.000
Max. :194312 Max. :100.000
Head of Inventory minifigs dataset
inventory_id fig_num quantity
3 fig-001549 1
4 fig-000764 1
19 fig-000555 1
25 fig-000574 1
26 fig-000842 1
26 fig-008641 1

4.2.5 Inventory parts

Inventory parts dataset dimensions
Rows Columns
1180987 6
Inventory parts dataset basic statistics
inventory_id part_num color_id quantity is_spare img_url
Min. : 1 Length:1180987 Min. : -1.0 Min. : 1.00 Length:1180987 Length:1180987
1st Qu.: 9404 Class :character 1st Qu.: 4.0 1st Qu.: 1.00 Class :character Class :character
Median : 22838 Mode :character Median : 15.0 Median : 2.00 Mode :character Mode :character
Mean : 50849 Mean : 131.8 Mean : 3.37
3rd Qu.: 87088 3rd Qu.: 71.0 3rd Qu.: 4.00
Max. :194312 Max. :9999.0 Max. :3064.00
Head of Inventory parts dataset
inventory_id part_num color_id quantity is_spare img_url
1 48379c01 72 1 f https://cdn.rebrickable.com/media/parts/photos/1/48379c01-1-e7daa845-2671-4737-8642-3b1574308155.jpg
1 48395 7 1 f https://cdn.rebrickable.com/media/parts/photos/7/48395-7-b9152acf-2fa5-4836-a04d-5b7fd39c2406.jpg
1 stickerupn0077 9999 1 f
1 upn0342 0 1 f
1 upn0350 25 1 f
3 2343 47 1 f https://cdn.rebrickable.com/media/parts/elements/3000240.jpg

4.2.6 Inventory sets

Inventory sets dataset dimensions
Rows Columns
4358 3
Inventory sets dataset basic statistics
inventory_id set_num quantity
Min. : 35 Length:4358 Min. : 1.000
1st Qu.: 8076 Class :character 1st Qu.: 1.000
Median : 16423 Mode :character Median : 1.000
Mean : 52519 Mean : 1.813
3rd Qu.: 98685 3rd Qu.: 1.000
Max. :191576 Max. :60.000
Head of Inventory sets dataset
inventory_id set_num quantity
35 75911-1 1
35 75912-1 1
39 75048-1 1
39 75053-1 1
50 4515-1 1
50 4520-1 2

4.2.7 Minifigs

Minifigs dataset dimensions
Rows Columns
13764 4
Minifigs dataset basic statistics
fig_num name num_parts img_url
Length:13764 Length:13764 Min. : 0.000 Length:13764
Class :character Class :character 1st Qu.: 4.000 Class :character
Mode :character Mode :character Median : 4.000 Mode :character
Mean : 5.296
3rd Qu.: 5.000
Max. :156.000
Head of Minifigs dataset
fig_num name num_parts img_url
fig-000001 Toy Store Employee 4 https://cdn.rebrickable.com/media/sets/fig-000001.jpg
fig-000002 Customer Kid 4 https://cdn.rebrickable.com/media/sets/fig-000002.jpg
fig-000003 Assassin Droid, White 8 https://cdn.rebrickable.com/media/sets/fig-000003.jpg
fig-000004 Man, White Torso, Black Legs, Brown Hair 4 https://cdn.rebrickable.com/media/sets/fig-000004.jpg
fig-000005 Captain America with Short Legs 3 https://cdn.rebrickable.com/media/sets/fig-000005.jpg
fig-000006 Lloyd Avatar 5 https://cdn.rebrickable.com/media/sets/fig-000006.jpg

4.2.8 Part categories

Part categories dataset dimensions
Rows Columns
66 2
Part categories dataset basic statistics
id name
Min. : 1.00 Length:66
1st Qu.:19.25 Class :character
Median :35.50 Mode :character
Mean :35.36
3rd Qu.:51.75
Max. :68.00
Head of Part categories dataset
id name
1 Baseplates
3 Bricks Sloped
4 Duplo, Quatro and Primo
5 Bricks Special
6 Bricks Wedged
7 Containers

4.2.9 Part relationships

Part relationships dataset dimensions
Rows Columns
29977 3
Part relationships dataset basic statistics
rel_type child_part_num parent_part_num
Length:29977 Length:29977 Length:29977
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
Head of Part relationships dataset
rel_type child_part_num parent_part_num
P 3626cpr3662 3626c
P 87079pr9974 87079
P 3960pr9971 3960
R 98653pr0003 98086pr0003
R 98653pr0003 98088pat0003
R 98653pr0003 98089pat0003

4.2.10 Parts

Parts dataset dimensions
Rows Columns
52615 4
Parts dataset basic statistics
part_num name part_cat_id part_material
Length:52615 Length:52615 Min. : 1.00 Length:52615
Class :character Class :character 1st Qu.:17.00 Class :character
Mode :character Mode :character Median :41.00 Mode :character
Mean :38.91
3rd Qu.:60.00
Max. :68.00
Head of Parts dataset
part_num name part_cat_id part_material
003381 Sticker Sheet for Set 663-1 58 Plastic
003383 Sticker Sheet for Sets 618-1, 628-2 58 Plastic
003402 Sticker Sheet for Sets 310-3, 311-1, 312-3 58 Plastic
003429 Sticker Sheet for Set 1550-1 58 Plastic
003432 Sticker Sheet for Sets 357-1, 355-1, 940-1 58 Plastic
003434 Sticker Sheet for Set 575-2, 653-1, 460-1 58 Plastic

4.2.11 Sets

Sets dataset dimensions
Rows Columns
21880 6
Sets dataset basic statistics
set_num name year theme_id num_parts img_url
Length:21880 Length:21880 Min. :1949 Min. : 1 Min. : 0.0 Length:21880
Class :character Class :character 1st Qu.:2001 1st Qu.:273 1st Qu.: 3.0 Class :character
Mode :character Mode :character Median :2012 Median :497 Median : 31.0 Mode :character
Mean :2008 Mean :442 Mean : 161.4
3rd Qu.:2018 3rd Qu.:608 3rd Qu.: 139.0
Max. :2024 Max. :752 Max. :11695.0
Head of Sets dataset
set_num name year theme_id num_parts img_url
001-1 Gears 1965 1 43 https://cdn.rebrickable.com/media/sets/001-1.jpg
0011-2 Town Mini-Figures 1979 67 12 https://cdn.rebrickable.com/media/sets/0011-2.jpg
0011-3 Castle 2 for 1 Bonus Offer 1987 199 0 https://cdn.rebrickable.com/media/sets/0011-3.jpg
0012-1 Space Mini-Figures 1979 143 12 https://cdn.rebrickable.com/media/sets/0012-1.jpg
0013-1 Space Mini-Figures 1979 143 12 https://cdn.rebrickable.com/media/sets/0013-1.jpg
0014-1 Space Mini-Figures 1979 143 2 https://cdn.rebrickable.com/media/sets/0014-1.jpg

4.2.12 Themes

Themes dataset dimensions
Rows Columns
468 3
Themes dataset basic statistics
id name parent_id
Min. : 1.0 Length:468 Min. : 1.0
1st Qu.:250.5 Class :character 1st Qu.:186.0
Median :466.0 Mode :character Median :411.0
Mean :433.5 Mean :360.6
3rd Qu.:625.2 3rd Qu.:512.5
Max. :752.0 Max. :697.0
NA’s :145
Head of Themes dataset
id name parent_id
1 Technic
3 Competition 1
4 Expert Builder 1
16 RoboRiders 1
17 Speed Slammers 1
18 Star Wars 1

5 Detailed analysis

5.1 Colors

5.1.2 Distribution of colors by transparency

5.2 Elements

5.3 Minifigs

5.4 Themes

5.5 Parts

5.6 Sets

5.6.1 Sets with the most parts

Sets with the most parts
Name Image Number of parts
World Map 11695
Eiffel Tower 10001
The Ultimate Battle for Chima 9987
Titanic 9092
Colosseum 9036
Millennium Falcon 7541
AT-AT 6785
The Razor Crest 6194
Lord of the Rings: Rivendell 6182
NINJAGO City Markets 6163

6 Correlation

6.1 Total number of colors used in the sets and the year

Correlation between the total number of colors and the year
Year Number of colors
Year 1.000 0.823
Number of colors 0.823 1.000

6.2 Total number of parts per year and year

Correlation between total number of parts per year and year of production
Year Total number of parts
Year 1.000 0.806
Total number of parts 0.806 1.000

6.3 Total number of sets and year

Correlation between total number of sets and year of production
Year Total number of sets
Year 1.000 0.879
Total number of sets 0.879 1.000

8 Forcasting

8.1 Forcast number of big sets

Projected number of large sets over 1,000 parts in the next few years
Year Number of big sets over 1000
2023 83.09
2024 93.22
2025 103.94
2026 113.44
2027 123.62
2028 133.77
2029 143.78
2030 153.79